Website to visit our Diabetes Prediction data set
Within this analysis, we’ll investigate factors correlated to diabetes. With a data set of 100,000 people, this investigation allows us to display relations between ages, HbA1c levels, smoking history, and glucose levels. With a wide range of data points, we begin to question if there are trends within this data that match our general understanding of diabetes. Our goal is to asses which of the 9 variables play a stronger role to the development of diabetes and if we can prove trends to better support our assumptions of this data. Through data visualization, chart analysis, and numerical analysis we will be able to present this data to convince a general audience of the important factors that contribute to diabetic trends.
library(tidyverse) # Loaded for dplyr
library(ggplot2) # Loaded for plotting
library(plotly) # Loaded for interactive plots
library(readr) # Loaded to read in data
library(knitr) # Loaded to compute and display data
library(scales) # Loaded to scale data
library(dplyr) # Loaded for data manipulation
diabetes_dataset <- read_csv("diabetes_prediction_dataset.csv", show_col_types = FALSE)
glimpse(diabetes_dataset)
## Rows: 100,000
## Columns: 9
## $ gender <chr> "Female", "Female", "Male", "Female", "Male", "Fem…
## $ age <dbl> 80, 54, 28, 36, 76, 20, 44, 79, 42, 32, 53, 54, 78…
## $ hypertension <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <chr> "never", "No Info", "never", "current", "current",…
## $ bmi <dbl> 25.19, 27.32, 27.32, 23.45, 20.14, 27.32, 19.31, 2…
## $ HbA1c_level <dbl> 6.6, 6.6, 5.7, 5.0, 4.8, 6.6, 6.5, 5.7, 4.8, 5.0, …
## $ blood_glucose_level <dbl> 140, 80, 158, 155, 155, 85, 200, 85, 145, 100, 85,…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Gender: The biological sex of the individual.
Age: Refers to the age of the individual.
Hypertension: Indicates whether individuals have this condition.
Heart disease: Indicates whether individuals have this condition.
Smoking history: Identifies individual’s smoking history.
BMI: A measure of body fat based on weight and height.
HbA1c level: Refers to the measure of a person’s average blood sugar level over the past 2-3 months.
Blood glucose level: Refers to the amount of glucose in the bloodstream at a given time.
Diabetes: Variable being predicted, with values of 1 indicating the presence of diabetes and 0 indicating no presence.
Can we predict diabetes status based on blood sugar levels?
For our first plot we filtered our data to categorize males and females as diabetic, pre-diabetic, and normal based on blood sugar levels(HbA1c). To do this we selected the variables of interest which are gender, diabetes, and HbA1c_level. From the first table shown we can see that blood sugar levels not necessarily define the diabetes status of the individuals within our data set.
# select, filter, and mutate the columns I want to work with
mut_diabdat_for_BSL <- diabetes_dataset %>%
select(gender, diabetes, HbA1c_level) %>%
filter(gender != "Other") %>% # we want to focus on male and female only
mutate(HbA1c_category = case_when( # categorize diabetes status based on BSL
HbA1c_level < 5.7 ~ "Normal < 5.7%",
HbA1c_level >= 5.7 & HbA1c_level <= 6.4 ~ "Prediabetic 5.7% - 6.4%",
HbA1c_level >= 6.5 ~ "Diabetic ≥ 6.5%",
TRUE ~ NA_character_
))
kable(head(mut_diabdat_for_BSL, 5), caption = "99,982 x 4 (first 5 rows)")
| gender | diabetes | HbA1c_level | HbA1c_category |
|---|---|---|---|
| Female | 0 | 6.6 | Diabetic ≥ 6.5% |
| Female | 0 | 6.6 | Diabetic ≥ 6.5% |
| Male | 0 | 5.7 | Prediabetic 5.7% - 6.4% |
| Female | 0 | 5.0 | Normal < 5.7% |
| Male | 0 | 4.8 | Normal < 5.7% |
# count, group_by, and mutate to obtain the number and percentage for each diabetic category for M and F
get_perc_count <- mut_diabdat_for_BSL %>%
count(gender, HbA1c_category) %>% # new count col of the number of people within each category for M and F
group_by(gender) %>% # group by gender
mutate(percent = n / sum(n) * 100) # get percentage of population within each diabetic category
# create my barplot, add x and y labels as well as color and text details.
get_perc_count_plotly <- ggplot(get_perc_count, aes(x = gender, y = n, fill = HbA1c_category)) +
geom_bar(stat = "identity", position = "fill", color = "black", linewidth = 0.1, alpha = 0.5) +
geom_text(aes(label = paste0(n, " (", round(percent, 1), "%)")),
position = position_fill(vjust = 0.5), size = 3) +
scale_fill_manual(values = c("Normal < 5.7%" = "cornsilk2",
"Prediabetic 5.7% - 6.4%" = "darkkhaki",
"Diabetic ≥ 6.5%" = "darkgoldenrod4")) +
labs(title = "Male vs. Female Blood Sugar Levels (HbA1c)",
x = "Gender", y = "Proportion", fill = "HbA1c Category") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))+
theme(
plot.title = element_text(size = 13),
axis.text = element_text(size = 11),
)
# bar plot interactive using ggplotly
ggplotly(get_perc_count_plotly)
kable(head(get_perc_count, 5), caption = "6 x 4 (first 5 rows)") # showing the first 5 rows of the data set
| gender | HbA1c_category | n | percent |
|---|---|---|---|
| Female | Diabetic ≥ 6.5% | 11835 | 20.21280 |
| Female | Normal < 5.7% | 22492 | 38.41372 |
| Female | Prediabetic 5.7% - 6.4% | 24225 | 41.37348 |
| Male | Diabetic ≥ 6.5% | 8959 | 21.62443 |
| Male | Normal < 5.7% | 15358 | 37.06976 |
How is age related to diabetes, heart disease, and hypertension?
To create this density plot we created 2 different groups of individuals, the first group are individuals with all 3 conditions which are diabetes, heart disease, and hypertension while the second group is free from all 3 conditions. After separating the 2 groups we brought it together to create a density plot and show the distribution within age. From this graph we concluded that there is a relationship between age and these 3 conditions. From the graph we observe that older individuals tend to have all 3 conditions.
# show a density plot to display age distribution based on diabetes and heart disease
# I create 2 new variables based on my original data set, I select and filter the columns I want to work with
DHH_only <- diabetes_dataset %>% # individuals with all 3 conditions
select(age, diabetes, heart_disease, hypertension) %>%
filter(age >= 2, diabetes == 1, heart_disease == 1, hypertension == 1) %>%
mutate(group = "Diabetes, H.D, and Hyp.")
kable(head(DHH_only, 5), caption = "358 x 5 (first 5 rows)") #showing the table
| age | diabetes | heart_disease | hypertension | group |
|---|---|---|---|---|
| 57 | 1 | 1 | 1 | Diabetes, H.D, and Hyp. |
| 62 | 1 | 1 | 1 | Diabetes, H.D, and Hyp. |
| 62 | 1 | 1 | 1 | Diabetes, H.D, and Hyp. |
| 67 | 1 | 1 | 1 | Diabetes, H.D, and Hyp. |
| 72 | 1 | 1 | 1 | Diabetes, H.D, and Hyp. |
No_DHH_only <- diabetes_dataset %>% # individuals free from all 3 conditions
select(age,heart_disease, diabetes, hypertension) %>%
filter(age >= 2, diabetes == 0, heart_disease == 0, hypertension ==0) %>%
mutate(group = "Free of Diabetes, H.D, and Hyp.")
kable(head(No_DHH_only, 5), caption = "81,885 x 5 (first 5 rows)") #showing the table
| age | heart_disease | diabetes | hypertension | group |
|---|---|---|---|---|
| 54 | 0 | 0 | 0 | Free of Diabetes, H.D, and Hyp. |
| 28 | 0 | 0 | 0 | Free of Diabetes, H.D, and Hyp. |
| 36 | 0 | 0 | 0 | Free of Diabetes, H.D, and Hyp. |
| 20 | 0 | 0 | 0 | Free of Diabetes, H.D, and Hyp. |
| 79 | 0 | 0 | 0 | Free of Diabetes, H.D, and Hyp. |
# create the density plot, I name my x and y labels, title, color, fill, and plot size details
interactive_DHH <- ggplot() +
geom_density(data = DHH_only, aes(x = age, fill = group), alpha = 0.5) + # Diabetes cases
geom_density(data = No_DHH_only , aes(x = age, fill = group), alpha = 0.5) +
labs(title = "Age Distribution: Diabetes vs. Heart Disease",
x = "Age",
y = "Density") +
scale_x_continuous(breaks = seq(2, 80, by = 6)) +
scale_fill_manual(values = c("Diabetes, H.D, and Hyp." = "#660033",
"Free of Diabetes, H.D, and Hyp." = "#66CCFF")) + # Custom colors
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(
plot.title = element_text(size = 13),
axis.text = element_text(size = 11),
)
# here I make the bar plot interactive using ggplotly
ggplotly(interactive_DHH)
Is there a correlation between body mass index and hypertension status?
To create this plot, we focused on the BMI and Hypertension variables from our data set. 0 representing no hypertension and 1 representing hypertension. From this graph we concluded that there is a correlation between BMI and hypertension status. We observe that individuals with hypertension tend to have higher BMI values compared to those without hypertension.
kable(head(diabetes_dataset, 5), caption = "10,000 x 9 (first 5 rows)") # display the diabetes_dataset table
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes |
|---|---|---|---|---|---|---|---|---|
| Female | 80 | 0 | 1 | never | 25.19 | 6.6 | 140 | 0 |
| Female | 54 | 0 | 0 | No Info | 27.32 | 6.6 | 80 | 0 |
| Male | 28 | 0 | 0 | never | 27.32 | 5.7 | 158 | 0 |
| Female | 36 | 0 | 0 | current | 23.45 | 5.0 | 155 | 0 |
| Male | 76 | 1 | 1 | current | 20.14 | 4.8 | 155 | 0 |
overall_median_bmi <- median(diabetes_dataset$bmi, na.rm = TRUE) # new variable for the median of bmi to obtain horizontal red line
#create a violin plot, hypertension is selected as our x variable and bmi as our y variable
BMI_distribution <- ggplot(diabetes_dataset, aes(x = factor(hypertension), y = bmi, fill = factor(hypertension))) + # use factor so that hypertension can be treated as a factor
geom_violin(alpha = 0.6) +
geom_hline(yintercept = overall_median_bmi, linetype = "solid", color = "red", linewidth = 0.3) + #Line marks a 27 BMI which is classified as an overweight value
scale_fill_manual(values = c("azure3", "#CC6600")) +
scale_y_continuous(breaks = seq(10.01, 95.69, by = 5)) +
labs(title = "BMI Distribution by Hypertension Status",
x = "Hypertension Status (0 = No, 1 = Yes)",
y = "BMI",
fill = "Hypertension") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))+
theme(
plot.title = element_text(size = 13),
axis.text = element_text(size = 11),
)
# make our plot interactive with ggplotly
ggplotly(BMI_distribution)
What’s the correlation between blood glucose levels and diabetes?
In this box plot we are comparing blood glucose levels by diabetes status, focusing on individuals within ages 3-80 because the youngest individual with diabetes is 3 years old and the oldest is 80. . From this plot we conclude that there is a strong correlation between both variables. From this graph we observe that individuals with higher blood glucose levels tend to fall within the diabetes status.
# these are the variables I worked with to create this table
table_diabetes_no_diabetes <- diabetes_dataset %>%
select(age,diabetes, blood_glucose_level) %>%
filter(age >= 3) %>%
mutate(diabetes = factor(diabetes, levels = c(0, 1), labels = c("No Diabetes", "Diabetes")))
# here I show the table
kable(head(table_diabetes_no_diabetes, 5), caption = "96,713 x 3 (first 5 rows)")
| age | diabetes | blood_glucose_level |
|---|---|---|
| 80 | No Diabetes | 140 |
| 54 | No Diabetes | 80 |
| 28 | No Diabetes | 158 |
| 36 | No Diabetes | 155 |
| 76 | No Diabetes | 155 |
diabetes_and_no_diabetes_plot <- diabetes_dataset %>%
select(blood_glucose_level,diabetes, age) %>%
filter(age >= 3) %>%
mutate(diabetes = factor(diabetes, levels = c(0, 1), labels = c("No Diabetes", "Diabetes"))) %>%
ggplot(aes(x = diabetes, y = blood_glucose_level, fill = diabetes)) +
geom_boxplot(alpha = 0.5) +
labs(title = "Blood Glucose Levels by Diabetes Status (age 3-80)",
x = "Diabetes Status",
y = "Blood Glucose Level") +
scale_fill_manual(values = c("Diabetes" = "#CC3300", "No Diabetes" = "#99CCFF")) +
labs(fill = "Status") +
scale_y_continuous(breaks = seq(80, 300, by = 20)) +
geom_hline(yintercept = c(120, 140), linetype = "solid", color = "#0033FF", size = 0.2) +
geom_hline(yintercept = c(200), linetype = "solid", color = "#CC3300", size = 0.3) + # Mark High Range
theme_void() +
theme(plot.title = element_text(hjust = 0.5))+
theme(
plot.title = element_text(size = 13),
axis.text = element_text(size = 11),
)
# make our plot interactive with ggplotly
ggplotly(diabetes_and_no_diabetes_plot)
Is BMI associated with diabetes, heart disease, and age across out data?
In this scatter plot we are comparing the age and BMI of 2 different groups of individuals; the first group is the group of individuals with diabetes only and the second group are individuals with heart disease. To achieve these results, we created 2 different tables for each group, then we combined the data to create our plot. In the graph we marked a red horizontal line representing the most common BMI for both groups. We concluded that there is a strong correlation between both conditions, BMI, and age. Higher BMI and older age are related to both conditions. We detected a high concentration of individuals who are 80 years old. This could be because diabetes and heart disease tend to be more common among older adults, besides that we discovered that our data set includes a disproportionately high number of individuals within the age of 80.
#diabetes_dataset
# here I create 2 variables that represent the data frames I'll be working on
diabetes_only <- diabetes_dataset %>%
select(age, bmi, diabetes) %>%
filter(age >= 2, diabetes == 1) %>%
mutate(condition = "Diabetes Only")
# here is a display the diabetes_only data frame
kable(head(diabetes_only, 5), caption = "8,500 x 4 (first 5 rows)")
| age | bmi | diabetes | condition |
|---|---|---|---|
| 44 | 19.31 | 1 | Diabetes Only |
| 67 | 27.32 | 1 | Diabetes Only |
| 50 | 27.32 | 1 | Diabetes Only |
| 73 | 25.91 | 1 | Diabetes Only |
| 53 | 27.32 | 1 | Diabetes Only |
heart_disease_only <- diabetes_dataset %>%
select(age, bmi, heart_disease) %>%
filter(age >= 2, heart_disease == 1) %>%
mutate(condition = "Heart Disease Only")
# here is a display the heart_disease_only data frame
kable(head(heart_disease_only, 5), caption = "3,942 x 4 (first 5 rows)")
| age | bmi | heart_disease | condition |
|---|---|---|---|
| 80 | 25.19 | 1 | Heart Disease Only |
| 76 | 20.14 | 1 | Heart Disease Only |
| 72 | 27.94 | 1 | Heart Disease Only |
| 67 | 27.32 | 1 | Heart Disease Only |
| 77 | 32.02 | 1 | Heart Disease Only |
# next, I combine the 2 data frames to create a scatter plot
combined_dataaa <- bind_rows(diabetes_only, heart_disease_only)
# here I want to show that there is a common BMI between individuals with diabetes only and individuals with heart disease only, I will show this by drawing a red horizontal line across the plot
common_bmi <- combined_dataaa %>%
count(bmi) %>% # count bmi
arrange(desc(n)) # sort by highest count
most_common_bmi <- common_bmi$bmi[1] # extract the most frequent bmi to mark a red horizontal line
ggplot(combined_dataaa) +
geom_point(aes(x = age, y = bmi, color = condition), alpha = 0.3) + # Scatter plot with BMI
geom_jitter(aes(x = age, y = bmi, color = condition), width = 0.1, height = 0.1, alpha = 0.3) +
geom_hline(aes(yintercept = 27.32, color = "Common BMI = 27.32"), linetype = "solid", size = 0.3) + # horizontal line marking common BMI
scale_color_manual(
values = c("Diabetes Only" = "cornflowerblue",
"Heart Disease Only" = "darkorchid4",
"Common BMI = 27.32" = "red"),
labels = c("Common BMI = 27.32", "Diabetes Only", "Heart Disease Only")) +
scale_x_continuous(breaks = seq(0, 80, by = 5)) +
scale_y_continuous(breaks = seq(10.01, 95.69, by = 10)) +
labs(title = "BMI vs. Age Across Diabetes & Heart Disease",
x = "Age",
y = "BMI",
color = "Condition") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(
axis.text = element_text(size = 11),
legend.text = element_text(size = 11),
plot.title = element_text(size = 13)
)
Each person within this scale has heart disease. Here a comparison is made between declared underweight and overweight people, grouped by sex, based on a BMI scale. There’s a significant increase in population percentage for those who are considered overweight and that have heart disease. With visual aid, it can be concluded that as weight increases, chances of heart disease will increase.
library(ggplot2)
library(dplyr)
# categories
bmi_heart_gender <- diabetes_dataset %>%
filter(heart_disease == 1) %>%
mutate(
bmi_category = case_when(
bmi <= 19 ~ "Underweight",
bmi >= 30 ~ "Overweight",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(bmi_category)) %>%
group_by(gender, bmi_category) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(gender) %>%
mutate(percentage = round(100 * count / sum(count), 1)) %>%
ungroup()
bmi_heart_gender$bmi_category <- factor(bmi_heart_gender$bmi_category,
levels = c("Underweight", "Overweight"))
max_ct <- max(bmi_heart_gender$count)
y_breaks <- pretty(c(0, max_ct), n = 10)
#plot
ggplot(bmi_heart_gender,
aes(x = bmi_category,
y = count,
group = gender,
color = gender)) +
geom_line(position = position_dodge(width = 0.5), linewidth = 1) +
geom_point(position = position_dodge(width = 0.5), size = 5) +
geom_text(aes(label = paste0(count, " (", percentage, "%)")),
vjust = -1,
position = position_dodge(width = 0.5)) +
scale_y_continuous(
breaks = y_breaks,
expand = expansion(mult = c(0, .05)) # so labels don’t get clipped
) +
labs(
title = "Heart Disease Cases by Gender and BMI Category",
x = "BMI Category",
y = "Count of People with Heart Disease",
color = "Gender"
) +
theme_minimal()
The data here is heavily dependent on BMI scale. It is important to note that BMI is not really a great determination for those who have diabetes, but there is a general trend within the data that people who have a BMI over 30 are more likely to be diabetic. In this graph we see that the percentages are that of people who are considered overweight and do not have diabetes. So, for the 18% represented of overweight men without diabetes there is 82% of overweight men with diabetes. The same logic applies to the data related to females.
library(dplyr)
library(ggplot2)
male_data = diabetes_dataset %>% filter(gender == "Male")
female_data = diabetes_dataset %>% filter(gender == "Female")
# Summarise men
men_summary <- male_data %>%
summarise(
count = sum(bmi >= 30 & diabetes == 0),
total = n()
) %>%
mutate(
sex = "Men",
percent = count / total * 100
)
# Summarise women
women_summary <- female_data %>%
summarise(
count = sum(bmi >= 30 & diabetes == 0),
total = n()
) %>%
mutate(
sex = "Women",
percent = count / total * 100
)
# Combine into one data frame
men_women_df = bind_rows(men_summary, women_summary)
#plot
ggplot(men_women_df, aes(x = sex, y = count, fill = sex)) +
geom_col(width = 0.6) +
geom_text(aes(label = sprintf("%d (%.1f%%)", count, percent)),
vjust = -0.5, size = 4) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(
title = "Count of Overweight (BMI ≥ 30) without diabetes, by Sex",
x = NULL,
y = "Number of Individuals",
fill = NULL
) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")
This depicts the different categories of HbA1c levels and their relation to a patients hypertension status
library(dplyr)
library(ggplot2)
library(scales)
#creating variables and assigning to dataset
df_summary <- diabetes_dataset %>%
mutate(
HbA1c_cat = case_when(
HbA1c_level < 5.7 ~ "< 5.7 (non diabetic)",
HbA1c_level >= 5.7 & HbA1c_level < 6.5 ~ "5.7–6.4 (prediabetic)",
HbA1c_level >= 6.5 ~ "≥ 6.5 (diabetic)"
),
hypertension = factor(hypertension, levels = c(0,1), #creating hypertension to a factor
labels = c("No", "Yes"))
) %>%
group_by(HbA1c_cat, hypertension) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(percent = n / sum(n)) # groups by HbA1c_cat
# plot
ggplot(df_summary, aes(x = HbA1c_cat,
y = percent,
fill = factor(hypertension))) +
geom_col(position = "stack") +
geom_text(aes(label = scales::percent(percent, 1)),
position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_discrete(name = "Hypertension",
labels = c("No", "Yes")) +
labs(
title = "Hypertension Status by HbA1c Category",
x = "HbA1c Category",
y = "Percent within Category"
) +
theme_light()
This graph shows the population density of men based on diabetes status, based on age range. Within our data, it is seen that there is a peak seen where a large population of men are diagnosed with diabetes around the age of 65. It is also seen that diabetes occurs for a significant population starting around the age of 20.
library(ggplot2)
# assigning dataset
df_male <- diabetes_dataset %>%
filter(gender == "Male") %>%
mutate(diabetes = factor(diabetes, #making diabetes status as a factor
levels = c(0,1),
labels = c("Non‑diabetic","Diabetic")))
# ggplot code
male_diabetic_dens = ggplot(df_male, aes(x = age, fill = diabetes)) +
geom_density(alpha = 0.5) +
scale_fill_manual(
name = "Diabetes status",
values = c("Non‑diabetic" = "lightpink",
"Diabetic" = "#301934")
) +
labs(
title = "Age Distribution in Male Subjects",
x = "Age",
y = "Density"
) +
theme_minimal()
ggplotly(male_diabetic_dens)
This graph shows the population density of women based on diabetes status, based on age range. Within our data, it is seen that there is a peak seen where a large population of women are diagnosed with diabetes around the age of 65. It is also seen that diabetes occurs for a significant population starting around the age of 30.
library(ggplot2)
# assigning dataset
df_female <- diabetes_dataset %>%
filter(gender == "Female") %>%
mutate(diabetes = factor(diabetes, #making diabetes status as a factor
levels = c(0,1),
labels = c("Non‑diabetic","Diabetic")))
# ggplot code
female_diabetic_dens = ggplot(df_female, aes(x = age, fill = diabetes)) +
geom_density(alpha = 0.5) +
scale_fill_manual(
name = "Diabetes status",
values = c("Non‑diabetic" = "lightpink",
"Diabetic" = "#309")
) +
labs(
title = "Age Distribution in Female Subjects",
x = "Age",
y = "Density"
) +
theme_minimal()
ggplotly(female_diabetic_dens)
In the smoking data there are 6 unique values
The total amount of people who fall into each category is as follows;
Never: 35095
Not current: 6447
Former: 9352
Current: 9286
Ever: 4004
No Info: 35816
There is quite a sizable amount of people in the No info category.
The total number of people in the dataset is 100000. To help clean up the data, we can filter ‘No Info’ people out. When we do that we get 64184.
# Figure out the unique categories of smoking history
#unique(diabetes_dataset$smoking_history)
# Count amount of people who belong to each unique smoking category
#Omit No info
smoking_tally <- diabetes_dataset %>% filter(smoking_history != 'No Info') %>% group_by(smoking_history) %>% summarise(total_people = n())
The data was then summarized to gather the total counts belonging to each smoking category and further grouped by diabetes status.
A percentage per smoking category with diabetes is then calculated dividing the count with diabetes by the total count in each smoking category.
#group diabetic vs non diabetic people together
smoking_diabetes_dataset <- diabetes_dataset %>%
filter(smoking_history != 'No Info') %>%
group_by(smoking_history, diabetes) %>%
summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.
# Inner join tally data with diabetic grouped data;
#mutate a column to create a percentage per category;
#select desired columns
smoking_diabetes_percentage <- inner_join(smoking_tally, smoking_diabetes_dataset, by = 'smoking_history') %>% mutate(Percentage = total/total_people) %>% select(smoking_history, diabetes, total, Percentage)
Now we can graph the relationship between smoking and diabetes as separated by smoking category.
#Create initial graph about smoking and diabetes
smoking_diabetes_graph <- ggplot(smoking_diabetes_percentage,
aes(x = smoking_history,
y = total,
fill = factor(diabetes))) +
geom_col(position = position_dodge(width = 0.9)) +
# add percent labels
geom_text(aes(label = percent(Percentage, accuracy = .1)),
position = position_dodge(width = 1),
vjust = .4, # nudge above bar
size = 3,
hjust = -.1) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
labs(
title = "Smoking History and Diabetes Relationship",
x = "Smoking History",
y = "Count of People",
fill = "Has Diabetes",
caption = "1 = diabetes, 0 = no diabetes"
) +
scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#ffbc42"))
smoking_diabetes_graph
The graph reveals that 17% of former smokers have diabetes, a significantly higher percentage than all other categories. Never smokers have the lowest percentage, with 9.5% having diabetes, and the other 3 categories range between 10-12%. This shows that former smokers are the most likely to have diabetes compared to all other smoking history categories.
Never: 9.5%
Not current: 10.7%
Former: 17%
Current: 10.2%
Ever: 11.8%
This data also included information about heart disease status. The same filters and groupings applied to the diabetes and smoking data was applied with heart disease as the focus.
smoking_heart_disease_data <- diabetes_dataset %>%
filter(smoking_history != 'No Info') %>%
group_by(smoking_history, heart_disease) %>%
summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.
smoking_heart_disease_percentage <- inner_join(smoking_tally, smoking_heart_disease_data, by = 'smoking_history') %>% mutate(Percentage = round( total/total_people, digits = 4)) %>% select(smoking_history, heart_disease, total, Percentage)
smoking_heart_disease_graph <- ggplot(smoking_heart_disease_percentage,
aes(x = smoking_history,
y = total,
fill = factor(heart_disease))) +
geom_col(position = position_dodge(width = 0.9)) +
# add percent labels
geom_text(aes(label = percent(Percentage, accuracy = .1)),
position = position_dodge(width = 1),
vjust = .4, # nudge above bar
size = 3,
hjust = -.1) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
labs(
title = "Smoking History and Heart Disease Relationship",
x = "Smoking History",
y = "Count of People",
fill = "Has Heart Disease",
caption = "1 = Heart Disease, 0 = No Heart Disease"
) +
scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#577590"))
smoking_heart_disease_graph
The graph reveals a similar relationship, but at much lower percentages. Never smokers once again are at the lowest risk for heart disease at 3.1%, while former smokers sit at 9.7%. Not Current and Current are both at a similar percentage for heart disease, while we see a spike in ever smokers.
Never: 3.1%
Not current: 4.5%
Former: 9.7%
Current: 4.4%
Ever: 7.8%
The lower percentages indicate less correlation between smoking and heart disease, however the trend of former smokers having higher risk for disease still exists.
The data also contained information on hypertension status. Once again, the same filtering and grouping methods were applied to the dataset using the hypertension column.
smoking_hypertension_data <- diabetes_dataset %>%
filter(smoking_history != 'No Info') %>%
group_by(smoking_history, hypertension) %>%
summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.
smoking_hypertension_percentage <- inner_join(smoking_tally, smoking_hypertension_data, by = 'smoking_history') %>% mutate(Percentage = round( total/total_people, digits = 4)) %>% select(smoking_history, hypertension, total, Percentage)
smoking_hypertension_graph <- ggplot(smoking_hypertension_percentage,
aes(x = smoking_history,
y = total,
fill = factor(hypertension))) +
geom_col(position = position_dodge(width = 0.9)) +
# add percent labels
geom_text(aes(label = percent(Percentage, accuracy = .1)),
position = position_dodge(width = 1),
vjust = .4, # nudge above bar
size = 3,
hjust = -.1) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
labs(
title = "Smoking History and Hypertension Relationship",
x = "Smoking History",
y = "Count of People",
fill = "Has Hypertension",
caption = "1 = Hypertension, 0 = No Hypertension"
) +
scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#f95d6a"))
smoking_hypertension_graph
The hypertension category reveals something a little different than the trend, as not current smokers have the lowest percentage of hypertension. Current and Never smokers are at essentially the same percentage, with current smokers having a slightly slower risk. Similar to the heart disease data, former smokers have the highest percentage of hypertension, and ever smokers following.
Never: 9.1%
Not current: 7.6%
Former: 14.3%
Current: 9.0%
Ever: 10.5%
When looking at the data, there is a spike in former smokers risk for the three health complications discussed.
Questions arose; If you never quit smoking, do you maintain similar risk to people who never have smoked? What changes from current to former smokers?
To have been classified a former smoker, you at one point had to have been a current smoker; which signifies a change in age from current to former smoking status.
A density plot of the age range by smoking category is created to visualize the ages where each smoking category range from.
smoking_data <- diabetes_dataset %>% filter(smoking_history != "No Info")
smoking_age_density <- ggplot(smoking_data, aes(x = age, color = as.factor(smoking_history))) + geom_density() +
scale_x_continuous(breaks = seq(0, 80, by = 5),
limits = c(0, 80)) +
labs(
title = "Smoking History Across Ages",
x = "Age",
color = "Smoking History"
)
smoking_age_density
The density plot shows an increase in former smokers from the 40-60 age range, while current smokers begin to decrease at around age 47. Similarly we see a decrease in all other categories at the 40-60 age range.
We can compare this with the density of people with diabetes, hypertension, and heart disease across all ages to see if there are similar spikes.
#AGE VS DIABETES
age_vs_diabetes <- ggplot(diabetes_dataset,
aes(x = age, fill = as.factor(diabetes))
) +
scale_x_continuous(breaks = seq(0, 80, by = 5),
limits = c(0, 80)) +
geom_density(alpha = .5) +
labs(
title = "Age and Diabetes",
x = "Age",
fill = "Diabetes",
caption = "1 = Diabetes, 0 = No Diabetes"
) +
scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#ffbc42"))
#AGE VS HYPERTENSION
age_vs_hypertension <- ggplot(diabetes_dataset,
aes(x = age, fill = as.factor(hypertension))
) +
geom_density(alpha = .5) +
scale_x_continuous(breaks = seq(0, 80, by = 5),
limits = c(0, 80)) +
labs(
title = "Age and Hypertension",
x = "Age",
fill = "Hypertension",
caption = "1 = Hypertension, 0 = No Hypertension"
) +
scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#f95d6a"))
# AGE VS HEART DISEASE
age_vs_heart_disease <- ggplot(smoking_data,
aes(x = age, fill = as.factor(heart_disease))) +
geom_density(alpha = .5) +
scale_x_continuous(breaks = seq(0, 80, by = 5),
limits = c(0, 80)) +
labs(
title = "Age and Heart Disease",
x = "Age",
fill = "Heart Disease",
caption = "1 = Heart Disease, 0 = No Heart Disease"
) +
scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#577590"))
age_vs_heart_disease
age_vs_hypertension
With diabetes, heart disease, and hypertension, we see a increase at a similar age range to the former smokers spike, from 40-60. This also highlights how age factors into diabetic health risk. The density plots of age and the three health issues reveal a relationship between age; the older you get, the more likely you are to have diabetes, heart disease, and hypertension, with a big spike once you are over 40 years old.
Former smokers are at a higher risk for diabetes, hypertension, and heart disease
As people get older, their risk for disease increases
Although current smokers risk is not reflected through our percentage graph, further digging shows that around the age where current smokers decrease and former smokers increase is around the same age range that the risk for disease increases
Within this analysis we’ve scaled:
Diabetes status based by age and gender
Relations towards BMI and Hypertension status
Shared populations between Diabetes, heart disease, and hypertension status
HbA1c trends among gender
Glucose levels dependent by diabetes status
Smoking status and its relation to Diabetes, heart disease, and hypertension
From this we were able to to effectively show and defend our general assumptions about diabetes and its co factors
This data is was gathered from private medical records. Because of confidentiality, the gathering methods of the data were not clear.
This data does not indicate whether our diabetic patients are type 1 or type 2 diabetic
This data is generated from various studies, making up the 100,000 patients
Blood glucose levels were not detailed on how this data was retrieved. Methods of measuring blood glucose levels vary from multiple hour fasts to immediately after a meal, each of which having different ranges to indicate diabetes.
Smoking History had a sizable number of omissions due to no info being gathered. Also, some of the categories are unclear in their definitions. An example of this is the not current category. Could they have been included in the former category, or if they are only abstaining for a set time period would they still be considered current?
The maximum age in the data is 80, however spikes at the age 80 range were noted. There may have been a moment where all data collected of ages older than 80 were assigned the age 80. However, we cannot be sure as no info was given.